Ascertaining the Francophone population in Ontario: validating the language variable in health data

Background Language barriers can impact health care and outcomes. Valid and reliable language data is central to studying health inequalities in linguistic minorities. In Canada, language variables are available in administrative health databases; however, the validity of these variables has not been studied. This study assessed concordance between language variables from administrative health databases and language variables from the Canadian Community Health Survey (CCHS) to identify Francophones in Ontario. Methods An Ontario combined sample of CCHS cycles from 2000 to 2012 (from participants who consented to link their data) was individually linked to three administrative databases (home care, long-term care [LTC], and mental health admissions). In total, 27,111 respondents had at least one encounter in one of the three databases. Language spoken at home (LOSH) and first official language spoken (FOLS) from CCHS were used as reference standards to assess their concordance with the language variables in administrative health databases, using the Cohen kappa, sensitivity, specificity, positive predictive value (PPV), and negative predictive values (NPV). Results Language variables from home care and LTC databases had the highest agreement with LOSH (kappa = 0.76 [95%CI, 0.735–0.793] and 0.75 [95%CI, 0.70–0.80], respectively) and FOLS (kappa = 0.66 for both). Sensitivity was higher with LOSH as the reference standard (75.5% [95%CI, 71.6–79.0] and 74.2% [95%CI, 67.3–80.1] for home care and LTC, respectively). With FOLS as the reference standard, the language variables in both data sources had modest sensitivity (53.1% [95%CI, 49.8–56.4] and 54.1% [95%CI, 48.3–59.7] in home care and LTC, respectively) but very high specificity (99.8% [95%CI, 99.7–99.9] and 99.6% [95%CI, 99.4–99.8]) and predictive values. The language variable from mental health admissions had poor agreement with all language variables in the CCHS. Conclusions Language variables in home care and LTC health databases were most consistent with the language often spoken at home. Studies using language variables from administrative data can use the sensitivity and specificity reported from this study to gauge the level of mis-ascertainment error and the resulting bias. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02220-7.


Introduction
In recent years, studies have provided evidence for the existence of health disparities across linguistic groups in Canada [1,2].However, most studies relied on census and survey data to examine the disparities by language characteristics [3][4][5].Administrative health databases are widely used to assess health and health care disparities; but the availability and quality of the language information is a barrier to performing health research on linguistic groups in Canada [6,7].Methodological challenges have hindered further research on the relationship between linguistic factors and health outcomes.Quality issues derived from collection methods, type of language recorded, and access to data prevent researchers from further exploring how linguistic factors are impacting health care and outcomes [8][9][10].Some studies have used language variables collected in healthcare databases; however, since their validity has never been formally assessed, the use of these variables has been limited and has generated conflicting results [8,9,11].

Language variables and linguistic groups
Linguistic groups are usually defined through language variables, either by a single variable that represents a simple linguistic concept (e.g., mother tongue, language most often spoken at home [LOSH], language of preference, etc.) or a combination of multiple variables (e.g., First Official Language Spoken [FOLS], which is derived from the Mother Tongue, Knowledge of Canadian Official Languages and LOSH) [12].Many of these language variables are routinely collected in the census and.Canadian Community Health Survey [CCHS]) in Canada and less often in administrative health databases.
Mother tongue, LOSH and more increasingly FOLS, are the language variables most commonly used in Canada to define and describe the characteristics of linguistic groups and to conduct comparative analyses in many studies, including those focusing on healthcare [9,[12][13][14][15].FOLS, which is defined within the framework of the Official Languages Act and represents a combination of several language variables, is increasingly being used in analyses and reports by Statistics Canada [14,16,17].FOLS is valuable for research purposes because it establishes linguistic groups denoting Canada's two official languages (English and French) while also including persons whose mother tongue is neither English nor French but who use one or both of these languages on a regular basis.Francophones are a linguistic minority outside Quebec.In Ontario, francophones make up about 4% of the population and research shows that francophone Ontarians face important health inequalities [5,18,19], but most of the analyses use survey data and only a few studies have used health data to identify the linguistic groups [11,[20][21][22].However, no previous study has examined the validity of the language information in administrative health data.Thus, we used several health databases from Ontario to assess its validity to identify francophones in health research.
This study sought to determine the ability to ascertain Francophones in Ontario using administrative health databases.Specifically, we assessed measures validity derived from language variables in administrative health databases to identify francophones, against a national survey standard, the CCHS, and determined the language concept captured by these variables.

Methods
The study used a data linkage of Ontario combined samples of the CCHS cycles 1. 1 (2000-2001) to 2012 that were securely linked to three administrative health databases using anonymized and unique encoded identifiers and analyzed in a secure environment at ICES (https:// www.ices.on.ca/; formerly Institute for Clinical Evaluative Sciences).

Data sources
The study population included Ontario respondents to the CCHS cycle 1. 1 (2000-2001) to 2012 cycle, 20 years and older who: (1) agreed to have their survey responses shared with the provinces and linked to their health care data (approximately 85% of participants) and (2) were eligible for Ontario's universal health insurance plan (OHIP).The CCHS is a cross-sectional national representative survey that collects information related to health status, health care utilization and health determinants of the Canadian population aged 12 years or older living in private dwellings in all provinces and territories.To the best of our knowledge, there are no systematic differences between participants in CCHS who provided consent to link their data and those who did not.
Thus, for creating the study dataset, the CCHS samples for Ontario (cycle 1. were combined.Then, the CCHS combined dataset was linked to three health databases that contain language information: the Continuing Care Reporting System (CCRS), which collects population-based resident information of patients receiving 24-hour nursing care in publicly funded residential long-term care; the Home Care Reporting System (HCRS), which comprises data using the Resident Assessment Instrument-Home Care (RAI-HC), which collects information on adults expected to receive home care services for at least six months; and the Ontario Mental Health Reporting System (OMHRS), which collects data on patients admitted to inpatient mental health services.Eligible participants were identified using OHIP and Registered Persons Database (RPDB) and were linked over the same period covered by the survey.Each dataset used in the study is described in Appendix 1.

Reference standard
Although there is no consensus regarding a reference standard for evaluating the quality of administrative data [23], numerous studies have used data from national representative surveys that provide accurate estimates of population characteristics, such as the CCHS, to validate administrative data in ascertaining chronic conditions (e.g., diabetes, hypertension, osteoporosis) [24][25][26][27][28][29][30][31].Language variables collected in self-report surveys (e.g., Census, CCHS) are more explicitly defined than administrative databases.The CCHS includes original language variables (e.g., mother tongue, LOSH, and knowledge of official languages) and derived variables, such as FOLS, which are based on two or more language variables.Despite minor modifications to variable definitions since the inception of the CCHS, these variables provide accurate estimates of the linguistic characteristics of the Canadian population [19,32,33].
Given the validity of national representative surveys conducted by Statistics Canada, we used the language variables from the CCHS, LOSH, an original variable collected in the survey and FOLS, which is a derived variable from the knowledge of official languages, mother tongue, and LOSH [34] as the reference standard measures to assess the capacity of health data to ascertain the Frenchspeaking population.The levels of non-response for the language variables in CCHS was low across cycles (< 5%), ranging from 0.2 to 2.7%.The levels of missing values in health data were also lower than 5%.We did not exclude the records with missing values for these variables and made no imputations.

From CCHS:
From administrative health databases (language variable label): -Mother tongue -HCRS (Primary Language) -Language spoken most often at home (LOSH) -CCRS (Primary language spoken at home on a regular basis) -Knowledge of official languages -OMHRS (Language) -Language of conversation -Language of interview -Language of preference -Language spoken to a doctor -First official language spoken (FOLS)

Administrative data and language information
The three administrative health databases (CCRS, HCRS and OMHRS) containing language information were used to identify Francophones.Without a clear and specific language definition, administrative health databases may be subject to interviewer bias (i.e., the interviewer may assume the respondent's language without explicitly asking for this information).Thus, the language variables from CCHS were used as the reference standard to validate the language variables in the health data.There are several language variables included in the survey (see Appendix 2), but LOSH and FOLS were used for the validity analysis.
The language variables Mother tongue, LOSH and language of conversation in CCHS allowed to derive the Knowledge of official languages and FOLS, following Statistics Canada's definition [34].Details on the collection of language variables are provided in Appendix 2.
Although it is possible to make population estimates using CCHS survey weights, in this study we reported unweighted values, which were used to perform the individual data linkage and the analyses.

Analysis
Descriptive analyses of the language variables in all databases were performed.First, a frequency analysis of all language variables was conducted, and the proportion of participants in each linguistic group was reported.We provide a covariate description of the sample stratified by language group (i.e.francophones) and by age group, sex, rural/urban area of residence, marital and immigrant status, education and income levels.Second, the linked data set was used to evaluate the concordance of the language variables in identifying francophones by performing an agreement analysis using Cohen's kappa coefficient, which is a widely used measure of concordance between assessors and indicates the proportion of agreement beyond that expected by chance [35].The levels of agreement for kappa were considered poor (κ < 0.20), fair (κ = 0.20 to 0.39), moderate (κ = 0.40 to 0.59), good (κ = 0.60 to 0.79), or very good (κ = 0.80 to 1.00) [25,36].Next, validity analyses were performed to determine the language concept captured by the language variables in administrative data.The validity of the language variables in administrative health data for identifying francophones was assessed by calculating the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) [36,37] using FOLS and LOSH as reference standards.All analyses were conducted using SAS 9.4 (SAS Institute, Inc., Cary, NC).
This project was approved by ICES' Privacy and Compliance Office.ICES is a prescribed entity under Sect.45 of Ontario's Personal Health Information Protection Act, which does not require review by a Research Ethics Board.

Results
The combined CCHS sample consisted of 198,509 respondents, which were individually linked to their provincial health card number that allowed individual linkage to the administrative databases, resulting in Ontarians within CCRS (including 212,954 individuals)), HCRS (n = 716,698 individuals) and OMHRS (n = 233,408 individuals).The linked dataset consisted of individuals who participated in at least one cycle of the CCHS cycle and who were captured in at least one of the three administrative health databases for the timespan of the CCHS cycles (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012).The final study sample consisted of 27,111 CCHS respondents who received home care services (HCRS) or long-term care services (CCRS) or were admitted to an inpatient mental health service (OMHRS) (Fig. 1).A summary of the characteristics of these databases by language group is provided in Table S1 in supplementary material.
Table 1 presents the unweighted frequencies for the characteristics of the 198,509 respondents from the combined CCHS cycles, and the characteristics of francophones identified by FOLS and LOSH (a weighed sample is presented in Table S2, Appendix 3).
Within the study sample, 6.3% were French speakers by mother tongue, 6.0% were identified as Francophone by FOLS, and 3.6% reported using French as the language often spoken at home (Table 2 and Table S3, Appendix 3).Less than 2% of respondents conducted the interview in French or indicated French as their preferred language for the interview.Even fewer respondents (1.8%) reported speaking French with their doctor.Based on the language variables in administrative health databases, long-term care data (CCRS) identified the largest proportion of French speakers (3.2%), followed by home care data using the HCRS (2.8%).
The analysis of the levels of concordance between the two data sources (self-report surveys and administrative health databases) showed that the language variables in the health data from home care and long-term care had the highest agreement with LOSH (kappa = 0.76 [0.73-0.79]and 0.75 [0.70-0.80],respectively) (Table 3).The language variables from these two databases (HCRS and CCRS) also held a high level of agreement with FOLS (kappa = 0.66 [0.61-0.71]for both).The language variable in OMHRS (mental health) had poor agreement with the language variables from survey data.

Discussion
This study sought to assess the validity of language variables in administrative health databases by comparing language variables recorded in these databases to language data from the CCHS (i.e., LOSH and FOLS), which was taken to be the reference standard.Agreement and validity analyses were carried out, with the objective of identifying Francophones in Ontario in administrative data.Language variables from home care and long-term care data had the highest level of agreement with LOSH and FOLS, while the language variable from mental health admissions had poor agreement with the language variables in the CCHS.
While "primary language" is the language variable most commonly used to collect information in healthcare settings [8,38,39], the definition varies across databases and across the healthcare literature; some studies define primary language to be analogous to the language most commonly used (e.g., at home, at school, at work) [8,40], while others consider the respondent's first language learned (or mother tongue) to be their primary language [41][42][43].The results of this study suggest that the linguistic concept captured by the language variables in both home care and long-term care databases is most similar to LOSH, which showed the highest level of agreement of the language information from home care and LTC settings (kappa = 0.764 and 0.75, respectively).Health care professionals who perform the interviews for home care using the HCRS are encouraged to "observe and listen" to the patient and their family to identify the patient's primary language and to determine the need for an interpreter [44].Thus, it is not surprising that the language variable in home care and long-term care databases (HCRS and CCRS) corresponds to the language that the patient most commonly uses to communicate in their own home.This definition of primary language (i.e., language most commonly used either at home or on a dayto-day basis) is similar to that used in previous healthcare studies performed with administrative data [45][46][47].
This study found a high level of concordance between language variables in administrative databases (HCRS and CCRS) when using FOLS as the reference standard.There was very high specificity for ascertaining the Francophones comparing administrative health databases to both LOSH and FOLS, but sensitivity was higher when compared against LOSH.These findings suggest that some home care and long-term care recipients who were identified as Francophones in administrative health databases captured those whose LOSH was French but often missed those whose FOLS was French, which is consistent with the finding of a higher proportion of French speakers with FOLS.Furthermore, given that mother tongue, a component of FOLS, captured the greatest number of Francophones, it is likely that FOLS identified Francophones by mother tongue who no longer speak French on a regular basis at home.In addition, given the higher level of bilingualism among francophones, might influence the decision of many of them to report English as the main language when seeking and receiving care in Ontario.This offer francophones some advantage in a linguistic minority context, when services in French are not available or experience of discrimination or lower quality of care.
The very high predictive values for both CCRS and HCRS implied that participants identified as 3%) Language often spoken at home (LOSH) [1] 6,040 (3.6%) Knowledge of Official Languages (KOL) [1,2] 128 (0.4%) First Official Language Spoken (FOLS) [1,3] 10,036 (  Francophones in the administrative health databases are very likely to have self-identified as Francophones in the CCHS.Coding errors in administrative health databases may partly account for low sensitivity.It is also possible that administrative health data captured individuals who are fully bilingual and comfortable seeking care from English providers (and thus more likely to be coded as Anglophone), whereas survey data may have included more unilingual Francophones (or Francophones with low English proficiency), who are less likely to seek healthcare services in Ontario [48,49], which are generally provided in English.
Interestingly, the rate of bilingualism is higher among Francophones than Anglophones [50], which is consistent with the finding of very high specificity and very high negative predictive value.In other words, some Francophones were identified as Anglophones, but very few Anglophones were identified as Francophones.Overall, these results highlight the importance of individual language preference for multilingual patients when seeking care, which may depend on the context (e.g., interpreter use, bilingual provider), as shown in other studies [41,51].
Concordance and sensitivity for identifying Francophones were very low for the OMHRS database.The poor concordance for the language variable in the database related to mental health hospitalizations (i.e., OMHRS) may be related to data entry errors and underreporting.Unlike home care and long-term care assessments, which are performed in the outpatient setting, data for OMHRS are collected in acute care settings.As such, it is likely that interviewers spend less time performing assessments for OMHRS because of competing tasks (e.g., admission documents, clinical care) that must also be performed simultaneously.Furthermore, since the OMHRS captures patients admitted to inpatient mental health hospitals, patients may not be able to provide accurate information due to an underlying mental health disorder (e.g., depression, mania, psychosis).In these situations, reported answers may be influenced by an accompanying person or may be assumed by the interviewer.These factors could bias the interviewers to report the patient's language as English since it is the most common language at most hospitals in Ontario.
Despite the high concordance of primary language captured in administrative health databases with the language reported in survey data, these results do not imply that there is a single approach to identifying linguistic groups.The approach to selecting the most appropriate language variable for a study should be guided by the design and research question of the study [7,12], since these elements can impact the language concept of interest.For example, researchers examining the impacts of language barriers may choose a variable that identifies people who can and cannot speak a given language, while researchers studying disparities across ethnolinguistic groups may select a variable such as mother tongue to identify all members of the group in question.
The study design should also be taken into account when performing validation studies of language variables in other administrative databases.Researchers should carefully consider the linguistic concept in the context of the proposed research question while also examining the quality of the administrative data to determine the optimal reference standard for validation.For example, FOLS, which creates linguistic groups denoting Canada's two official languages (English and French), may not be relevant when studying minority groups other than Francophones, which consist of a higher proportion of individuals who speak neither English nor French.In such instances, language variables such as LOSH or mother tongue may be more suitable.

Strengths and limitations
For this study, two language variables from a self-report survey (CCHS) were used as the reference standard for respondents' language.This reference standard, which has not been validated to our knowledge, is subject to self-reporting bias since respondents may overestimate or underestimate their language proficiency.However, self-reported data have previously been used in validation studies of other administrative databases [24][25][26].Moreover, the CCHS is a nationally representative survey that provides robust cross-sectional estimates of sociodemographic and health characteristics of the Canadian population [52].Finally, the proportions of Francophones and other linguistic groups by mother tongue, LOSH and FOLS from the CCHS are consistent with those obtained from census data [14].
Nevertheless, there may remain response bias in the CCHS, as some bilingual participants may have reported English or French as the language often spoken at home despite speaking both languages on a regular basis.Contextual factors may also influence an individual's decision to report his or her primary language in administrative health databases.Since English is the most common language in Ontario, Francophones who also speak English may have reported their primary language as English because they perceived this answer to be more favorable (social desirability bias).This factor may have led to an underestimation of the number of Francophones identified by administrative health databases.

Conclusions and implications
To our knowledge, no previous study has examined the agreement between language variables in survey data and administrative health databases.This study revealed that language variables in administrative health databases of home care and long-term care have a high level of concordance with LOSH and FOLS and, thus, can be used to reliably identify linguistic groups for the purpose of performing research to assess the impact of language factors on health outcomes.However, caution must be exercised when using language variables collected from acute care settings (such as OMHRS), as these variables may be less reliable.These results suggest that the language concept captured by administrative health databases, particularly from home care and long-term care data, is most similar to language spoken at home.Reporting guidelines recommend studies that use routinely collected data report potential measurement error and how measurement error potentially biases the study's findings [37].Hence, the findings from this study can be used for this purpose.by an annual grant from the Ontario MOHLTC.ICES has been approved by Ontario's Information and Privacy Commissioner since 2005.The opinions, results, and conclusions reported in this article are those of the authors and are independent from the funding sources.No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred.ICES collects information most notably for purposes of Sect.45 of Ontario's Personal Health Information Protection Act (PHIPA).RB is supported by the Institut du Savoir Montfort.Tanuseputro is supported by a PSI Graham Farquharson Knowledge Translation Fellowship.

Fig. 1
Fig. 1 Sample size from each data source and linked sample.CCRS: Continuing Care Reporting System.HCRS: Home Care Reporting System.OMHRS: Ontario Mental Health Reporting System.CCHS: Canadian Community Health Survey

Fig. 2
Fig. 2 Sensitivity, specificity and predictive values of language variables in administrative health data (n = 27,111).Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value, CCRS: Continuing Care Reporting System, HCRS: Home Care Reporting System, OMHRS: Ontario Mental Health Reporting System

Table 2
Frequency of Francophones by type of language variable from each data source

Table 3
: Canadian Community Health Survey, CCRS: Continuing Care Reporting System, HCRS: Home Care Reporting System, OMHRS: Ontario Mental Health Reporting System *KOL was only available in the 2011/2012 cycle (-) No valid records for estimation CCHS